In this project we will analyse a series of data on the selling price of houses in Melbourne, Australia. We use Australian dollars (AUD) for the prices.
The main objective is to better understand how house prices work in this city using the data we have available. I obtained this data from Kaggle, at this link: https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market?resource=download.
house = read.csv('Melbourne_housing_FULL.csv', header = TRUE, sep = ",")
colnames(house)
## [1] "Suburb" "Address" "Rooms" "Type"
## [5] "Price" "Method" "SellerG" "Date"
## [9] "Distance" "Postcode" "Bedroom2" "Bathroom"
## [13] "Car" "Landsize" "BuildingArea" "YearBuilt"
## [17] "CouncilArea" "Lattitude" "Longtitude" "Regionname"
## [21] "Propertycount"
We have 21 columns
missing_values <- sapply(house, function(x) sum(is.na(x)))
print(missing_values)
## Suburb Address Rooms Type Price
## 0 0 0 0 7610
## Method SellerG Date Distance Postcode
## 0 0 0 0 0
## Bedroom2 Bathroom Car Landsize BuildingArea
## 8217 8226 8728 11810 21115
## YearBuilt CouncilArea Lattitude Longtitude Regionname
## 19306 0 7976 7976 0
## Propertycount
## 0
All the variables were correlated with distance, and the others with a lot of NAs. At the same postal code corresonds the same distance, so I deleted poste code.
colnames(houses)
## [1] "Rooms" "Type" "Price" "Distance" "Bathroom"
## [6] "Car" "Landsize" "Regionname"
This are the features that a choose to keep: Rooms : the numbers of rooms in the house Type : the type of the house Price : the price of the house Distance: the distance from the city center Bathroom : the number of bathroom Car: the number of car spots Landsize : the meters of the inside of the house Regionname: the region name
The objective of this investigation is to determine the a posteriori distribution of parameters within a linear regression model. These parameters represent the coefficients that delineate the connection between the predictor variables and the response variable, in particular ‘price’. In the Bayesian approach, we start from initial beliefs expressed through a priori distributions, which evolve and improve based on the observed data.
I have adopted Normal a priori distributions for the coefficients and a Gamma a priori distribution for the precision parameter known as tau. This choice is based on the conjugate nature of the Gamma distribution in Bayesian linear regression, ensuring consistency in the belief updating process when the likelihood is Gaussian.
A crucial element of this project is the analysis of the fit of the regression model, which involves the use of diagnostic outputs from Markov Chain Monte Carlo (MCMC) algorithms. These outputs provide insights into the convergence of the algorithm and its overall performance, helping to assess the reliability of the results obtained.
In the model building process, I initially include four predictors carefully selected through a targeted design phase. These predictors will be chosen to capture crucial aspects that influence the price of housing. Using the Bayesian approach, I will be able to conduct a detailed exploratory analysis using MCMC algorithms. Furthermore, I will pay special attention to point and interval estimates, which respectively represent the most likely and plausible values for the model parameters.
I will then compare the initial model and a simplified version, which will only include the variable ‘rooms’, which is highly correlated with the variable ‘price’. The objective here is to determine which of the two models fits the data best, using the Deviance Information Criterion (DIC). This criterion assesses both the fit to the data and the complexity of the model.
I chose to use the programme JAGS (Just Another Gibbs Sampler) to perform Bayesian analysis using MCMC algorithms. These algorithms generate a Markov chain, which evolves to converge to the desired probability distribution. Samples taken from the chain will be exploited to obtain an approximate idea of the a posteriori distribution of parameters. In summary, my project aims to reveal the intricate relationships between variables through the Bayesian approach, with an emphasis on diagnostic analysis and comparison between different models.
In this dataset there are missing values.
missing_values <- sapply(houses, function(x) sum(is.na(x)))
print(missing_values)
## Rooms Type Price Distance Bathroom Car Landsize
## 0 0 0 0 0 0 0
## Regionname
## 0
| n | mean | sd | median | min | max | skew | kurtosis | CoV | completeness | |
|---|---|---|---|---|---|---|---|---|---|---|
| Rooms | 14012 | 3.18 | 0.9059209 | 3.0 | 1.0 | 12.0 | 0.4257323 | 1.874082 | 0.2848808 | 100 |
| Type* | 14012 | 1.34 | 0.6955705 | 1.0 | 1.0 | 3.0 | 1.7597348 | 1.372769 | 0.5190825 | 100 |
| Price | 14012 | 1060357.13 | 616379.8492569 | 885000.0 | 85000.0 | 11200000.0 | 2.4950554 | 14.377490 | 0.5812946 | 100 |
| Distance | 14012 | 13.04 | 6.1873296 | 11.4 | 6.1 | 48.1 | 1.9518057 | 5.110630 | 0.4744885 | 100 |
| Bathroom | 14012 | 1.62 | 0.7143837 | 2.0 | 0.0 | 9.0 | 1.2762430 | 3.873143 | 0.4409776 | 100 |
| Car | 14012 | 1.83 | 0.9887381 | 2.0 | 0.0 | 18.0 | 1.8037023 | 10.712379 | 0.5402940 | 100 |
| Landsize | 14012 | 603.55 | 2055.9402337 | 566.0 | 0.0 | 146699.0 | 43.2842086 | 2409.357396 | 3.4064124 | 100 |
| Regionname* | 14012 | 4.76 | 2.1632982 | 6.0 | 1.0 | 8.0 | -0.5615635 | -1.158177 | 0.4544744 | 100 |
Here, we have a summary table presenting the mean, standard deviation, skewness, kurtosis, and covariance for various features. Furthermore, we observe that if the skewness value exceeds +1 or goes below –1, it implies a significant skewed distribution with one tail being longer than the other, causing asymmetry. Similarly, for kurtosis, values surpassing +1 suggest an overly peaked distribution, signifying a narrow concentration of responses around the center. As a result, we anticipate skewed and peaked patterns in features like , Price, Distance, Car and Landsize.
Let’s have a look how the response variable “Price” is distributed:
I choose to apply the logarithm transformation to the output variable ‘Price’, this is a variable that as a high variance, the transformation reduce the variance.
houses$Price <- log(houses$Price)
This is the distribution of the Log(price), is what we expected.
Type and Regionname
We notice that in box plot we have the type h box that takes the widest range of prices, and we can see, in black, that we have some outliers. With different medians.
While in the second plot, we can notice that the houses with less rooms corresponds to a less price, but the price tend to stay around the mean. Otherwise more rooms doesn’t mean that corresponds to higher price.
Rooms, Distance, Bedroom2, Bathroom, Car, Landsize.
The number houses with 3 Rooms are more than the others, where for the houses that have 6 or 1 rooms are less.
There are more houses with distance 2 from the city center,
In this case we can notice that we have more houses with a 0-2.5 number of bathrooms, otherwise are less the houses with more bathrooms.
The observation have a carspot equal to 2, and we have less houses with car spot more than 2.
We can observe that we have more houses with a Landsize around 5 and less observation for an highest number of squared meters.
From this correlation matrix we can observe that the most correlated features with the variable Pricear e: Rooms, Bathroom and Bedroom. The less correlated is Distance.
## Rooms Price Distance Bathroom Car Landsize
## Rooms 1.00000000 0.49613925 0.18392438 0.61144991 0.34337225 0.07495509
## Price 0.49613925 1.00000000 -0.25020316 0.42931205 0.24051406 0.05502003
## Distance 0.18392438 -0.25020316 1.00000000 0.09917488 0.13417768 0.13767849
## Bathroom 0.61144991 0.42931205 0.09917488 1.00000000 0.27382689 0.04405954
## Car 0.34337225 0.24051406 0.13417768 0.27382689 1.00000000 0.06114349
## Landsize 0.07495509 0.05502003 0.13767849 0.04405954 0.06114349 1.00000000
In this context, our focus variable, Price (Y), is well-suited to be represented by a Normal distribution:
\(Y_i \sim N(\mu, \tau^2)\)
To encapsulate the coefficients \(\beta_i\), which can span the entire real number range, we employ a Normal distribution. For robustness, we assign a mean of 0 to each beta, making it easier to gauge their significance. Furthermore, by incorporating a small variance in the prior distribution, we prevent overly flat behavior around the mean, enhancing the model’s performance. This choice is a result of experimentation and fine-tuning: \(\beta_i \sim N(\mu = 0, \sigma^2 = 0.001)\).
Similarly, we handle \(\tau\) through a gamma distribution \(\tau \sim dgamma(0.001,0.001)\).
The gamma distribution proves suitable due to its applicability to positive variables like \(\tau\), which represents the inverse of variance.
The linear regression is a statistical technique used to model the relationship between a dependent variable (Y) and one or more independent variables (X1,…,Xn). In traditional linear regression, the goal is to estimate model parameters as point estimates, assuming that the response variable (y) is a linear combination of predictor variables (X) and weights (\(\beta\)), with a random error component (\(\epsilon\)).
The optimal weights are determined by minimizing the Residual Sum of Squares (RSS) using Ordinary Least Squares (OLS). However, this method doesn’t account for uncertainty in these estimates.
Bayesian linear regression offers an alternative by treating model parameters as probability distributions. Unlike frequentist methods, Bayesian regression incorporates prior probability distributions about the parameters, leading to a probabilistic model.
The linear relationship between predictors and weights is maintained, with the response variable represented as Y∼N(\(\beta\)TX,$$2I). The goal is to identify the posterior distribution of weights based on the prior probabilities.
Choosing the correct priors is crucial. Two options are parameter-specific priors (based on existing knowledge) and general uninformative priors. Weakly informative priors are recommended when there’s a substantial amount of data available.
In constructing models, logarithmic-transformed outlier-free data is used for more accurate estimation.
houses2$Regionname <- as.factor(houses2$Regionname)
houses2$Type <- as.factor(houses2$Type)
model_1_2 = lm(Price ~ Distance + Bathroom + Car + Landsize + Rooms + Regionname + Type, data = houses2)
summary(model_1_2)
##
## Call:
## lm(formula = Price ~ Distance + Bathroom + Car + Landsize + Rooms +
## Regionname + Type, data = houses2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57058 -0.16990 -0.00833 0.16609 2.34649
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 14.572312950 0.022084816 659.834
## Distance -0.522017330 0.007422510 -70.329
## Bathroom 0.097418148 0.004295878 22.677
## Car 0.033639618 0.002561636 13.132
## Landsize 0.000010157 0.000001162 8.745
## Rooms 0.145216230 0.003764114 38.579
## RegionnameEastern Victoria -0.064243916 0.028455912 -2.258
## RegionnameNorthern Metropolitan -0.376344473 0.007957399 -47.295
## RegionnameNorthern Victoria -0.273590076 0.028777810 -9.507
## RegionnameSouth-Eastern Metropolitan 0.051193629 0.012151544 4.213
## RegionnameSouthern Metropolitan 0.171746636 0.007776183 22.086
## RegionnameWestern Metropolitan -0.371732380 0.007893594 -47.093
## RegionnameWestern Victoria -0.547165296 0.035855928 -15.260
## Typet -0.241185651 0.009245203 -26.088
## Typeu -0.577516302 0.008309864 -69.498
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Distance < 0.0000000000000002 ***
## Bathroom < 0.0000000000000002 ***
## Car < 0.0000000000000002 ***
## Landsize < 0.0000000000000002 ***
## Rooms < 0.0000000000000002 ***
## RegionnameEastern Victoria 0.024 *
## RegionnameNorthern Metropolitan < 0.0000000000000002 ***
## RegionnameNorthern Victoria < 0.0000000000000002 ***
## RegionnameSouth-Eastern Metropolitan 0.0000254 ***
## RegionnameSouthern Metropolitan < 0.0000000000000002 ***
## RegionnameWestern Metropolitan < 0.0000000000000002 ***
## RegionnameWestern Victoria < 0.0000000000000002 ***
## Typet < 0.0000000000000002 ***
## Typeu < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2765 on 13997 degrees of freedom
## Multiple R-squared: 0.6955, Adjusted R-squared: 0.6952
## F-statistic: 2284 on 14 and 13997 DF, p-value: < 0.00000000000000022
The model attempts to predict the price of houses using various independent variables such as distance, number of rooms The coefficients show the predicted effect of each independent variable on price. The model fits the data well, with low Residual standard error: 0.2765 and low p-value for the F-statistic, indicating that the variables are significant.
I created a matrix for my model to handle my qualitatives variables.
## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 14012
## Unobserved stochastic nodes: 16
## Total graph size: 224971
##
## Initializing model
## Inference for Bugs model at "/var/folders/_s/9tvv9sbd3hld86_pfz5vn_d80000gn/T//RtmpYiwzdM/model59a7e5226e9.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 2000 discarded)
## n.sims = 24000 iterations saved
## mu.vect sd.vect 2.5% 25% 50%
## Beta_Bathroom 0.097 0.004 0.089 0.095 0.097
## Beta_Car 0.034 0.003 0.029 0.032 0.034
## Beta_Distance -0.522 0.007 -0.537 -0.527 -0.522
## Beta_Landsize 0.000 0.000 0.000 0.000 0.000
## Beta_Regionname_East -0.064 0.028 -0.120 -0.083 -0.064
## Beta_Regionname_North_Metro -0.376 0.008 -0.392 -0.382 -0.376
## Beta_Regionname_North_Vic -0.274 0.029 -0.330 -0.293 -0.274
## Beta_Regionname_SE_Metro 0.051 0.012 0.027 0.043 0.051
## Beta_Regionname_South_Metro 0.172 0.008 0.157 0.166 0.172
## Beta_Regionname_West_Metro -0.372 0.008 -0.387 -0.377 -0.372
## Beta_Regionname_West_Vic -0.547 0.036 -0.618 -0.571 -0.548
## Beta_Rooms 0.145 0.004 0.138 0.143 0.145
## Beta_Type_T -0.241 0.009 -0.259 -0.247 -0.241
## Beta_Type_U -0.578 0.008 -0.594 -0.583 -0.578
## alpha 14.572 0.022 14.529 14.557 14.572
## sigma 0.277 0.002 0.273 0.275 0.277
## deviance 3740.762 5.667 3731.672 3736.655 3740.140
## 75% 97.5% Rhat n.eff
## Beta_Bathroom 0.100 0.106 1.001 24000
## Beta_Car 0.035 0.039 1.001 19000
## Beta_Distance -0.517 -0.507 1.001 24000
## Beta_Landsize 0.000 0.000 1.001 24000
## Beta_Regionname_East -0.045 -0.009 1.001 24000
## Beta_Regionname_North_Metro -0.371 -0.361 1.001 23000
## Beta_Regionname_North_Vic -0.254 -0.217 1.001 13000
## Beta_Regionname_SE_Metro 0.059 0.075 1.001 24000
## Beta_Regionname_South_Metro 0.177 0.187 1.001 20000
## Beta_Regionname_West_Metro -0.366 -0.356 1.001 24000
## Beta_Regionname_West_Vic -0.523 -0.476 1.001 24000
## Beta_Rooms 0.148 0.153 1.001 19000
## Beta_Type_T -0.235 -0.223 1.001 21000
## Beta_Type_U -0.572 -0.561 1.001 24000
## alpha 14.587 14.616 1.001 24000
## sigma 0.278 0.280 1.001 24000
## deviance 3744.122 3753.531 1.001 24000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 16.1 and DIC = 3756.8
## DIC is an estimate of expected predictive error (lower deviance is better).
Rhat: Rhat is a measure that is used to verify the chains have converged sufficiently to estimate the posterior distribution accurately. If is equal to 1 : it represents the ideal scenario, it indicates that the chains have converged to the same stationary distribution and that the parameter estimates are reliable. In this case is good!
n.eff: Is the effective dimention of the sample, is a measure to analize the quality of the MCMC samples, generated during the estimation process.
The use of MCMC methods involves generating a set of samples that are not completely independent because of autocorrelation between consecutive samples. The “n.eff” value helps us measure this autocorrelation by providing an estimate of how many independent samples would be required from a set of fully independent samples to achieve the same precision. The effective sample size is equal to the total number of iteration.
This are the desinty plot of my variables:
mcmcplots::denplot(fit_1_2, parms = params, col = c("red","green", "blue"))
## Registered S3 method overwritten by 'mcmcplots':
## method from
## as.mcmc.rjags R2jags
As we can see form the output of the density plot form this model. We can notice that we have some variables that are not good for the model as Bathroom, Landsize.
Let’s see how chains behave when sampling statistical parameters:
mcmcplots::traplot(fit_1_2, parms = params, col = c("red","green", "#66b3ff"))
data.mcmc = as.mcmc(fit_1_2)
autocorr.diag(data.mcmc)
## Beta_Bathroom Beta_Car Beta_Distance Beta_Landsize
## Lag 0 1.0000000000 1.000000000 1.0000000000 1.000000000
## Lag 1 0.0008516296 0.008872778 0.0137824271 -0.009023677
## Lag 5 -0.0013080101 -0.002294027 -0.0003496973 0.003950920
## Lag 10 0.0030801285 -0.003450346 -0.0042193961 -0.007119308
## Lag 50 -0.0025463630 0.002058074 0.0060457139 0.014040218
## Beta_Regionname_East Beta_Regionname_North_Metro
## Lag 0 1.0000000000 1.0000000000
## Lag 1 0.0010558213 -0.0032625364
## Lag 5 -0.0006362428 0.0002706934
## Lag 10 -0.0074031912 -0.0122684579
## Lag 50 0.0026185198 0.0024065224
## Beta_Regionname_North_Vic Beta_Regionname_SE_Metro
## Lag 0 1.000000000 1.0000000000
## Lag 1 0.007351084 0.0007478729
## Lag 5 -0.000141573 -0.0083288735
## Lag 10 -0.000165628 -0.0002158231
## Lag 50 0.015353324 0.0028509239
## Beta_Regionname_South_Metro Beta_Regionname_West_Metro
## Lag 0 1.0000000000 1.000000000
## Lag 1 0.0054969990 0.011940043
## Lag 5 0.0009655084 0.002391959
## Lag 10 0.0022862394 -0.002486875
## Lag 50 0.0067923420 -0.003518156
## Beta_Regionname_West_Vic Beta_Rooms Beta_Type_T Beta_Type_U
## Lag 0 1.000000000 1.0000000000 1.000000000 1.00000000000
## Lag 1 -0.004786014 -0.0004956968 -0.006018956 -0.00848528953
## Lag 5 0.006007570 -0.0034507017 0.008325940 0.00004263689
## Lag 10 0.007986878 -0.0095443411 0.014506070 0.00175838359
## Lag 50 0.002596834 0.0047224126 0.020163467 0.00885727367
## alpha deviance sigma
## Lag 0 1.000000000 1.0000000000 1.000000000
## Lag 1 0.011576137 0.0003303936 -0.007437471
## Lag 5 0.007469858 -0.0080743788 0.004336850
## Lag 10 -0.007487659 -0.0107071359 -0.002544111
## Lag 50 -0.001864515 -0.0078958062 0.005843469
In this case, the effective sample size is equal, for each parameter, to the total number of iterations: in fact we notice how the autocorrelation goes to practically zero already after the first lag.
I keep just the significative variables here.
Classic model:
model_2_2 = lm(Price ~ Distance + Regionname + Type, data = houses2)
summary(model_2_2)
##
## Call:
## lm(formula = Price ~ Distance + Regionname + Type, data = houses2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56345 -0.19337 -0.01309 0.18716 2.18556
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 15.1493198 0.0240458 630.019
## Distance -0.4508653 0.0086601 -52.063
## RegionnameEastern Victoria -0.1029587 0.0334747 -3.076
## RegionnameNorthern Metropolitan -0.4416753 0.0093018 -47.483
## RegionnameNorthern Victoria -0.2802714 0.0335434 -8.355
## RegionnameSouth-Eastern Metropolitan 0.0006284 0.0143012 0.044
## RegionnameSouthern Metropolitan 0.1827005 0.0091534 19.960
## RegionnameWestern Metropolitan -0.4119047 0.0092756 -44.407
## RegionnameWestern Victoria -0.6430332 0.0422364 -15.225
## Typet -0.3040194 0.0105579 -28.795
## Typeu -0.8470998 0.0085258 -99.357
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Distance <0.0000000000000002 ***
## RegionnameEastern Victoria 0.0021 **
## RegionnameNorthern Metropolitan <0.0000000000000002 ***
## RegionnameNorthern Victoria <0.0000000000000002 ***
## RegionnameSouth-Eastern Metropolitan 0.9650
## RegionnameSouthern Metropolitan <0.0000000000000002 ***
## RegionnameWestern Metropolitan <0.0000000000000002 ***
## RegionnameWestern Victoria <0.0000000000000002 ***
## Typet <0.0000000000000002 ***
## Typeu <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3261 on 14001 degrees of freedom
## Multiple R-squared: 0.5765, Adjusted R-squared: 0.5762
## F-statistic: 1906 on 10 and 14001 DF, p-value: < 0.00000000000000022
Also here we can notice a low Residual standard error: 0.3261 that was expected because R-squared tends to decrease when the number of predictors used is reduced. Then the p-value is also low and Fstatistics of 1906 lower than before.
Also here i created the matrix.
lm2_2 = function(){
# Likelihood:
for (i in 1:N){
Price[i] ~ dnorm(mu[i], tau)
mu[i] = alpha + Beta_Distance * Distance[i] +
Beta_Regionname_East * Region_East[i] +
Beta_Regionname_North_Metro * Region_North_Metro[i] +
Beta_Regionname_North_Vic * Region_North_Vic[i] +
Beta_Regionname_SE_Metro * Region_SE_Metro[i] +
Beta_Regionname_South_Metro * Region_South_Metro[i] +
Beta_Regionname_West_Metro * Region_West_Metro[i] +
Beta_Regionname_West_Vic * Region_West_Vic[i] +
Beta_Type_T * Typet[i] + Beta_Type_U * Typeu[i]
}
# Priors:
alpha ~ dnorm(0, 0.001)
Beta_Distance ~ dnorm(0, 0.001)
tau ~ dgamma(0.0001, 0.0001)
sigma = 1/sqrt(tau)
Beta_Regionname_East ~ dnorm(0, 0.001)
Beta_Regionname_North_Metro ~ dnorm(0, 0.001)
Beta_Regionname_North_Vic ~ dnorm(0, 0.001)
Beta_Regionname_SE_Metro ~ dnorm(0, 0.001)
Beta_Regionname_South_Metro ~ dnorm(0, 0.001)
Beta_Regionname_West_Metro ~ dnorm(0, 0.001)
Beta_Regionname_West_Vic ~ dnorm(0, 0.001)
Beta_Type_T ~ dnorm(0, 0.001)
Beta_Type_U ~ dnorm(0, 0.001)
}
init_values <- function(){
list(alpha = 15.1493198,
Beta_Distance = -0.4508653 ,
tau = 0.0240458,
Beta_Regionname_East = -0.1029587 ,
Beta_Regionname_North_Metro = -0.4416753 ,
Beta_Regionname_North_Vic = -0.2802714 ,
Beta_Regionname_SE_Metro = 0.0006284 ,
Beta_Regionname_South_Metro = 0.1827005 ,
Beta_Regionname_West_Metro = -0.4119047 ,
Beta_Regionname_West_Vic = -0.6430332 ,
Beta_Type_T = -0.3040194 ,
Beta_Type_U = -0.8470998 )
}
params <- c(
"alpha",
"Beta_Distance",
"sigma",
"Beta_Regionname_East",
"Beta_Regionname_North_Metro",
"Beta_Regionname_North_Vic",
"Beta_Regionname_SE_Metro",
"Beta_Regionname_South_Metro",
"Beta_Regionname_West_Metro",
"Beta_Regionname_West_Vic",
"Beta_Type_T",
"Beta_Type_U"
)
data_lm <- list(
N = nrow(covariates),
Price = houses2$Price,
Distance = covariates[, "Distance"],
Region_East = covariates[, "Region_East"],
Region_North_Metro = covariates[, "Region_North_Metro"],
Region_North_Vic = covariates[, "Region_North_Vic"],
Region_SE_Metro = covariates[, "Region_SE_Metro"],
Region_South_Metro = covariates[, "Region_South_Metro"],
Region_West_Metro = covariates[, "Region_West_Metro"],
Region_West_Vic = covariates[, "Region_West_Vic"],
Typet = covariates[, "Typet"],
Typeu = covariates[, "Typeu"]
)
library(rjags)
fit_2_2 <- jags(
data = data_lm,
inits = init_values,
parameters.to.save = params,
model.file = lm2_2,
n.chains = 3,
n.iter = 10000,
n.burnin = 2000,
n.thin = 1,
DIC = TRUE
)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 14012
## Unobserved stochastic nodes: 12
## Total graph size: 154904
##
## Initializing model
fit_2_2
## Inference for Bugs model at "/var/folders/_s/9tvv9sbd3hld86_pfz5vn_d80000gn/T//RtmpYiwzdM/model59a79a795c2.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 2000 discarded)
## n.sims = 24000 iterations saved
## mu.vect sd.vect 2.5% 25% 50%
## Beta_Distance -0.451 0.009 -0.468 -0.457 -0.451
## Beta_Regionname_East -0.103 0.033 -0.168 -0.125 -0.103
## Beta_Regionname_North_Metro -0.442 0.009 -0.460 -0.448 -0.442
## Beta_Regionname_North_Vic -0.281 0.034 -0.346 -0.303 -0.281
## Beta_Regionname_SE_Metro 0.001 0.014 -0.027 -0.009 0.001
## Beta_Regionname_South_Metro 0.183 0.009 0.165 0.177 0.183
## Beta_Regionname_West_Metro -0.412 0.009 -0.430 -0.418 -0.412
## Beta_Regionname_West_Vic -0.643 0.042 -0.725 -0.672 -0.643
## Beta_Type_T -0.304 0.011 -0.325 -0.311 -0.304
## Beta_Type_U -0.847 0.008 -0.864 -0.853 -0.847
## alpha 15.149 0.024 15.103 15.133 15.149
## sigma 0.326 0.002 0.322 0.325 0.326
## deviance 8359.327 4.906 8351.724 8355.764 8358.646
## 75% 97.5% Rhat n.eff
## Beta_Distance -0.445 -0.434 1.001 24000
## Beta_Regionname_East -0.080 -0.038 1.001 24000
## Beta_Regionname_North_Metro -0.436 -0.424 1.001 24000
## Beta_Regionname_North_Vic -0.258 -0.215 1.001 24000
## Beta_Regionname_SE_Metro 0.010 0.028 1.001 17000
## Beta_Regionname_South_Metro 0.189 0.201 1.001 24000
## Beta_Regionname_West_Metro -0.406 -0.394 1.001 24000
## Beta_Regionname_West_Vic -0.615 -0.560 1.001 24000
## Beta_Type_T -0.297 -0.283 1.001 24000
## Beta_Type_U -0.841 -0.830 1.001 24000
## alpha 15.165 15.196 1.001 24000
## sigma 0.327 0.330 1.001 24000
## deviance 8362.176 8370.705 1.001 24000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 12.0 and DIC = 8371.4
## DIC is an estimate of expected predictive error (lower deviance is better).
We can observe that the parameter are distant from the zero, that suggest could be a good model.
Rhat: Rhat is a measure that is used to verify the chains have converged sufficiently to estimate the posterior distribution accurately. If is equal to 1 : it represents the ideal scenario, it indicates that the chains have converged to the same stationary distribution and that the parameter estimates are reliable. In this case is good!
n.eff: Is the effective dimetion of the sample, is a measure to analize the quality of the MCMC samples, generated during the estimation process.
The use of MCMC methods involves generating a set of samples that are not completely independent because of autocorrelation between consecutive samples. The “n.eff” value helps us measure this autocorrelation by providing an estimate of how many independent samples would be required from a set of fully independent samples to achieve the same precision. The effective sample size is equal to the tola numbe of iteration.
mcmcplots::denplot(fit_2_2, parms = params, col = c("red","green", "blue"))
These density plot are better than before, as we can notice that the distribution is distant from zero.
Let’s see how chains behave when sampling statistical parameters:
mcmcplots::traplot(fit_2_2, parms = params, col = c("red","green", "#66b3ff"))
These trace plots show the paths of the chains for each estimated parameter, the chains are stable. The patterns in these chains look consistent, without any significant ups and downs or irregularities.
data.mcmc = as.mcmc(fit_2_2)
autocorr.diag(data.mcmc)
## Beta_Distance Beta_Regionname_East Beta_Regionname_North_Metro
## Lag 0 1.0000000000 1.0000000000 1.000000000
## Lag 1 0.0059516009 -0.0090862085 -0.002640255
## Lag 5 -0.0005723312 -0.0052702714 0.002090457
## Lag 10 -0.0002191763 -0.0033712309 0.003932570
## Lag 50 -0.0102337502 -0.0007025339 -0.010460763
## Beta_Regionname_North_Vic Beta_Regionname_SE_Metro
## Lag 0 1.0000000000 1.000000000
## Lag 1 -0.0098239603 -0.002045862
## Lag 5 0.0005645275 -0.005571314
## Lag 10 -0.0089595783 -0.001942899
## Lag 50 0.0074749351 -0.008669531
## Beta_Regionname_South_Metro Beta_Regionname_West_Metro
## Lag 0 1.000000000 1.0000000000
## Lag 1 -0.009641407 -0.0100202657
## Lag 5 0.010426582 0.0005028016
## Lag 10 0.014109591 0.0088913028
## Lag 50 0.004054867 0.0003817099
## Beta_Regionname_West_Vic Beta_Type_T Beta_Type_U alpha
## Lag 0 1.000000000 1.000000000 1.000000000 1.0000000000
## Lag 1 -0.005387491 0.001412278 -0.005698463 0.0046564591
## Lag 5 0.004016464 -0.003962876 -0.002138771 0.0009853653
## Lag 10 0.002702652 -0.005223812 0.004768854 0.0003697222
## Lag 50 0.005231720 -0.008619071 0.001120265 -0.0055470904
## deviance sigma
## Lag 0 1.000000000 1.000000000
## Lag 1 -0.008311503 -0.004644410
## Lag 5 -0.005393054 0.013042658
## Lag 10 -0.004148114 0.002868564
## Lag 50 0.002163484 -0.013278013
In this case, the effective sample size is equal, for each parameter, to the total number of iterations: in fact we notice how the autocorrelation goes to practically zero already after the first lag.
model_1_2_3 = lm(Price ~ Distance + Regionname + Type, data = houses)
summary(model_1_2_3)
##
## Call:
## lm(formula = Price ~ Distance + Regionname + Type, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.55010 -0.20021 -0.01462 0.19144 2.11235
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 14.4139457 0.0116968 1232.296
## Distance -0.0310859 0.0006225 -49.937
## RegionnameEastern Victoria 0.1112993 0.0350054 3.179
## RegionnameNorthern Metropolitan -0.4302412 0.0093398 -46.065
## RegionnameNorthern Victoria -0.0677398 0.0350084 -1.935
## RegionnameSouth-Eastern Metropolitan 0.0708590 0.0149491 4.740
## RegionnameSouthern Metropolitan 0.1957700 0.0091687 21.352
## RegionnameWestern Metropolitan -0.3844181 0.0092208 -41.690
## RegionnameWestern Victoria -0.4958935 0.0431338 -11.497
## Typet -0.3059461 0.0106299 -28.782
## Typeu -0.8333361 0.0085478 -97.491
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Distance < 0.0000000000000002 ***
## RegionnameEastern Victoria 0.00148 **
## RegionnameNorthern Metropolitan < 0.0000000000000002 ***
## RegionnameNorthern Victoria 0.05302 .
## RegionnameSouth-Eastern Metropolitan 0.00000216 ***
## RegionnameSouthern Metropolitan < 0.0000000000000002 ***
## RegionnameWestern Metropolitan < 0.0000000000000002 ***
## RegionnameWestern Victoria < 0.0000000000000002 ***
## Typet < 0.0000000000000002 ***
## Typeu < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3282 on 14001 degrees of freedom
## Multiple R-squared: 0.571, Adjusted R-squared: 0.5706
## F-statistic: 1863 on 10 and 14001 DF, p-value: < 0.00000000000000022
Here we can see that the Residual standard error: 0.3282 is quite similar to the previous model, the F-statistic: 1863, lower than before, less significative. Then the p-value is also lower here.
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 14012
## Unobserved stochastic nodes: 12
## Total graph size: 154904
##
## Initializing model
fit_1_2_3
## Inference for Bugs model at "/var/folders/_s/9tvv9sbd3hld86_pfz5vn_d80000gn/T//RtmpYiwzdM/model59a6bbfbc59.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 2000 discarded)
## n.sims = 24000 iterations saved
## mu.vect sd.vect 2.5% 25% 50%
## Beta_Distance -0.031 0.001 -0.032 -0.032 -0.031
## Beta_Regionname_East 0.112 0.035 0.043 0.088 0.112
## Beta_Regionname_North_Metro -0.430 0.009 -0.448 -0.437 -0.430
## Beta_Regionname_North_Vic -0.068 0.035 -0.136 -0.091 -0.067
## Beta_Regionname_SE_Metro 0.071 0.015 0.042 0.061 0.071
## Beta_Regionname_South_Metro 0.196 0.009 0.178 0.190 0.196
## Beta_Regionname_West_Metro -0.384 0.009 -0.403 -0.391 -0.384
## Beta_Regionname_West_Vic -0.496 0.043 -0.580 -0.525 -0.496
## Beta_Type_T -0.306 0.011 -0.327 -0.313 -0.306
## Beta_Type_U -0.833 0.009 -0.850 -0.839 -0.833
## alpha 14.414 0.012 14.391 14.406 14.414
## sigma 0.328 0.002 0.324 0.327 0.328
## deviance 8542.292 4.908 8534.691 8538.715 8541.615
## 75% 97.5% Rhat n.eff
## Beta_Distance -0.031 -0.030 1.001 24000
## Beta_Regionname_East 0.135 0.180 1.001 24000
## Beta_Regionname_North_Metro -0.424 -0.412 1.001 24000
## Beta_Regionname_North_Vic -0.044 0.001 1.001 24000
## Beta_Regionname_SE_Metro 0.081 0.101 1.001 24000
## Beta_Regionname_South_Metro 0.202 0.214 1.001 24000
## Beta_Regionname_West_Metro -0.378 -0.366 1.001 24000
## Beta_Regionname_West_Vic -0.467 -0.411 1.001 24000
## Beta_Type_T -0.299 -0.285 1.001 8500
## Beta_Type_U -0.828 -0.816 1.001 10000
## alpha 14.422 14.437 1.001 24000
## sigma 0.330 0.332 1.001 8500
## deviance 8545.194 8553.563 1.001 24000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 12.0 and DIC = 8554.3
## DIC is an estimate of expected predictive error (lower deviance is better).
What we can observe from the output model is that the variable Distance is close to zero, that was expected from the fact that here we don’t apply the log-transformation (compared to the other model that was distant from zero) The DIC is higher than before, also this suggest that this model is not good than the previous on.
Rhat: Rhat is a measure that is used to verify the chains have converged sufficiently to estimate the posterior distribution accurately. If is equal to 1 : it represents the ideal scenario, it indicates that the chains have converged to the same stationary distribution and that the parameter estimates are reliable. In this case is good!
n.eff: Is the effective dimetion of the sample, is a measure to analize the quality of the MCMC samples, generated during the estimation process.
The use of MCMC methods involves generating a set of samples that are not completely independent because of autocorrelation between consecutive samples. The “n.eff” value helps us measure this autocorrelation by providing an estimate of how many independent samples would be required from a set of fully independent samples to achieve the same precision. The effective sample size is equal to the total number of iteration.
mcmcplots::denplot(fit_1_2_3, parms = params, col = c("red","green", "blue"))
Let’s see how chains behave when sampling statistical parameters.
mcmcplots::traplot(fit_1_2_3, parms = params, col = c("red","green", "#66b3ff"))
The Deviance Information Criterion (DIC) is a measure used to compare the goodness of fit between Bayesian models. It is based on the deviance of the data in the models and penalises the most complex models. A lower value of DIC indicates a better fit of the model to the data. Comparing the three values, we’ll prefer the first one.
data.mcmc = as.mcmc(fit_1_2_3)
autocorr.diag(data.mcmc)
## Beta_Distance Beta_Regionname_East Beta_Regionname_North_Metro
## Lag 0 1.0000000000 1.000000000 1.000000000
## Lag 1 0.0005773426 0.004963718 0.006566662
## Lag 5 0.0042457932 -0.007574358 -0.005228373
## Lag 10 -0.0054802582 -0.009533484 -0.003354560
## Lag 50 -0.0155701867 0.002298049 -0.001153344
## Beta_Regionname_North_Vic Beta_Regionname_SE_Metro
## Lag 0 1.000000000 1.00000000000
## Lag 1 -0.003034786 0.00003230321
## Lag 5 0.009875955 -0.01227922436
## Lag 10 -0.003600110 -0.00824171542
## Lag 50 0.007016904 0.00571241925
## Beta_Regionname_South_Metro Beta_Regionname_West_Metro
## Lag 0 1.000000000 1.000000000
## Lag 1 -0.001951669 -0.012802177
## Lag 5 -0.004968893 0.001454240
## Lag 10 -0.000141767 -0.003782652
## Lag 50 -0.001529814 0.002575784
## Beta_Regionname_West_Vic Beta_Type_T Beta_Type_U alpha
## Lag 0 1.000000000 1.0000000000 1.000000000 1.000000000
## Lag 1 0.007304329 -0.0043162735 -0.008369544 -0.003751995
## Lag 5 -0.001232082 0.0054484873 -0.006247572 0.005358433
## Lag 10 -0.000173676 0.0039731407 -0.002043820 0.001739888
## Lag 50 -0.010433098 0.0009331902 -0.009830853 -0.008112180
## deviance sigma
## Lag 0 1.00000000000 1.000000000
## Lag 1 0.00009866016 0.012157888
## Lag 5 0.00656531771 -0.008126648
## Lag 10 0.00454312129 -0.004329357
## Lag 50 0.00254325426 0.013300870
In this case, the effective sample size is equal, for each parameter, to the total number of iterations: in fact we notice how the autocorrelation goes to practically zero already after the first lag.
fit_1_2$BUGSoutput$DIC
## [1] 3756.82
fit_2_2$BUGSoutput$DIC
## [1] 8371.363
fit_1_2_3$BUGSoutput$DIC
## [1] 8554.335
In the model with all variables and the logarithmic scale applied to “Distance,” I obtained a DIC value of 3756.82. This value indicates that the model has a good goodness of fit to the data compared to the other models. Including more variables in a regression model tends to increase the amount of variance explained by the data. This happens because each variable added to the model has the potential to contribute to the explanation of variance in the response variable.
In the second model, in which I removed some insignificant variables but kept the logarithmic scale for “Distance,” the DIC value is 8371.363. This value is slightly higher than in the full model, which might indicate that removing the non significant variables made the model less fit to the data.
The logarithmic transformation improved the goodness of fit of the model to the observed data, this transformation can help reduce the non-uniform variation in the data, making the model more stable and reliable.
In the third model, I kept only the most significant variables but removed the logarithmic scale from “Distance”, resulting in a DIC value of 8554.335. This value is the highest among the three models, suggesting that the model with untransformed “Distance” may be the least fit to the data. When ‘Distance’ is on a logarithmic scale, the density plot shows a distribution of observations further away from zero, indicating that the variable has a greater ability to explain the variation in the response variable.
It’s not good rlying just on the DIC to determine the actual goodness of a model.
In comparing the frequentist and Bayesian approaches, we find that they often produce similar results. Frequentist methods rely on data likelihood for inference, while Bayesian methods incorporate prior beliefs and update them with data. While they can align when prior beliefs are weak, discrepancies may arise when there are significant prior-data differences. It’s crucial to choose the approach that suits the problem and consider their philosophical and practical distinctions.